Abstract
Background: Rapid and accurate identification of sickle cell disease (SCD) is crucial in emergency medicine, with the chronic complications of SCD making clinical management profoundly different for SCD patients compared to non-SCD or sickle cell trait (SCT) individuals. However, differences in electronic health records (EHR) between healthcare systems and at times inaccurate information in the EHR (estimated at around 10% for SCD status) can make identifying SCD status unreliable. Comprehensive testing to diagnose SCD through hemoglobin electrophoresis or high-performance liquid chromatography (HPLC) can be time-consuming, expensive, and not viable on the rapid timeframe needed for acute care. This diagnostic gap in emergency medicine is not unique to SCD; in fact, the recent development and FDA approval of machine learning (ML) based clinical tools for identifying conditions such as arrhythmia and hemorrhages has established a precedent for the use of ML as a clinical tool. Following this promising trend, we developed SIGHT (Sickle Cell Identification from General Hematological Testing), a fast and accurate ML-based system to identify the likelihood that an individual has SCD based on a single, readily available complete blood count (CBC).
Methods: A dataset of 8,069 CBCs from adults was assembled from a combination of National Health and Nutrition Examination Survey (NHANES) data and clinical data SCD visits to Grady Memorial Hospital in Atlanta, GA from 2020-2025. Hemoglobin phenotypes derived from electrophoresis were used to identify SCD patients due to greater reliability than ICD codes. As the NHANES cohort did not include electrophoresis data, all individuals from this data source were classified as controls. After removal of missing values, the final dataset comprised 8,062 samples (5.74% SCD cases) and was split into an 80% training set and a 20% testing set, stratified by SCD status to preserve class proportions. A scalar of 16.20 (ratio of cases to controls) was applied during model training to reflect 0.03% prevalence of SCD in the US. The final model was constructed using tidymodels and XGBoost frameworks in R with 10-fold stratified cross-validation and hyperparameter tuning. Input data includes red blood cell (RBC) count, total hemoglobin, hematocrit (HCT), mean corpuscular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), and red cell distribution width (RDW). The final model was tested on the test dataset and performance was assessed through a confusion matrix and derived statistics including accuracy, sensitivity, specificity, F₁-score, macro-averaged F₁-score, and log loss. Feature importance was quantified using SHAP (SHapley Additive exPlanations). An interactive web application was developed to provide predictions with individualized SHAP explanations.
Results: SIGHT achieved an accuracy of 99.3%, a sensitivity of 99.5%, and specificity of 95.5% with 1,517 samples in the test dataset correctly identified as controls and 84 samples correctly identified as SCD cases, with only 8 false positives and 4 false negatives. The F₁ score of SCD was 0.996 and the macro-averaged F₁ on the test set was 0.965 indicating a low risk of false positives or negatives. Log loss was 0.027, indicating that predicted probabilities are only 2-3% from the truth. SHAP feature importance ranked MCHC as the most important predictor, followed by RDW, HCT, RBC count, MCV, and total hemoglobin. The model is very stable, with a mean CV AUROC of 0.929 ± 0.021 and a mean CV F₁-score of 0.977 ± 0.002.
Discussion: With its high accuracy and reliability, SIGHT is a promising clinical tool for SCD screening in acute care settings, helping inform physician management and indicate if a patient should undergo additional testing to confirm their SCD diagnosis. This can be especially useful in contexts where EHR are unreliable or unavailable, or more comprehensive SCD testing is limited. Additionally, a visualization of how the data was used to make the disease status identification is shown to the user, meaning SIGHT does not operate as a 'black box’. However, a limitation of the current implementation of SIGHT is that transfusion history was not included in the training data, meaning that donor blood can sometimes mask the hematological markers of SCD. We will continue to work to increase the accuracy of the model in real-world use and explore the integration of SIGHT into existing EHR and laboratory information systems.